Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add weekly hash check to NUTS file and XLSX to YAML utility function #40

Merged
merged 4 commits into from
Dec 11, 2024

Conversation

dc-almeida
Copy link
Collaborator

Closes #38
Adds a weekly GH action to compare the hash of a local NUTS file with the EUROSTAT website version
Adds a XLSX to YAML utility function for use when updating the local file to latest version

@dc-almeida dc-almeida added the enhancement New feature or request label Nov 25, 2024
@dc-almeida dc-almeida self-assigned this Nov 25, 2024
@dc-almeida dc-almeida changed the title Add utility to convert NUTS to YAML format Add weekly hash check to NUTS file and XLSX to YAML utility function Nov 25, 2024
@dc-almeida dc-almeida marked this pull request as ready for review November 25, 2024 09:17
Copy link
Contributor

@phackstock phackstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One general thought from my side. Now that you save a copy of the NUTS file as part of this repository, you don't really need the yaml files anymore. I'd suggest to do away with the yaml files and just read the Excel file directly. This way you also save the additional step of converting the Excel to yaml in case there is any update. And you exclude the error case where the Excel file and the yaml files are out of sync.
@danielhuppmann, what are your thoughts on the matter?
In addition, one comment below in line.

.github/workflows/check_hash.yml Show resolved Hide resolved
@danielhuppmann
Copy link
Member

First question: is the performance difference between an xlsx spreadsheet versus the yaml files noticeable?

Second question: if Eurostat again changes the file (hosted on the same url), we will only be notified that the hashes don't match, without any guidance on the actual change... Having the xlsx-to-yaml utility can be a useful way to find the change and/or check whether the relevant for us.

@phackstock
Copy link
Contributor

First question: is the performance difference between an xlsx spreadsheet versus the yaml files noticeable?

Technically, yaml should be faster than xlsx. However, in practice, reading data is orders of magnitude slower so the difference does not matter.

Second question: if Eurostat again changes the file (hosted on the same url), we will only be notified that the hashes don't match, without any guidance on the actual change... Having the xlsx-to-yaml utility can be a useful way to find the change and/or check whether the relevant for us.

Ok, if Eurostat does not provide a changelog then I can see the value of comparing yaml files. We should still be careful though that the Excel file and the yaml files match. Otherwise we might get a wrong warning.

Copy link
Contributor

@phackstock phackstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to be merged from my side.
The only thing that might be nice to add is a short markdown or rst file on how to update the NUTS regions in case the file changes again. It might seem straightforward now but in a year or two we are guaranteed to forget about it.

@dc-almeida
Copy link
Collaborator Author

Ok, if Eurostat does not provide a changelog then I can see the value of comparing yaml files. We should still be careful though that the Excel file and the yaml files match. Otherwise we might get a wrong warning.

YAML files are also comparable in GitHub changelogs, whereas XLSX files are not, so it allows tracking changes to the regions.

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

@dc-almeida dc-almeida merged commit 342913f into main Dec 11, 2024
5 checks passed
@dc-almeida dc-almeida deleted the feature/check-hash-weekly branch December 17, 2024 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Updated source file by EUROSTAT
3 participants